Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace globby with much faster code when gitignore enabled #426

Merged
merged 6 commits into from
Jan 11, 2024

Conversation

phiresky
Copy link
Contributor

@phiresky phiresky commented Jan 3, 2024

The current dir walking method has two issues:

  1. gitignores are searched for and parsed for every single plugin for every single workspace package. This search is very inefficient and walks the full directory tree, without any early exiting / pruning.
  2. the gitignores are applied in memory only after walking (again, no early exiting / pruning).

The reason for both is in the globby() calls.

On every globby(patterns) call, previously this happened:

  1. The full repo was recursively walked with a fast-glob glob of '**/.gitignore' to find the ignore files
  2. All the ignore files are parsed
  3. Now the fast-glob call with the patterns given in the config is executed (not the ones in gitignore!)
  4. Now afterwards, all the ignores from gitignore are filtered out in memory (in a bit buggy way)

globby is just a thin wrapper around fast-glob, and the only globby feature knip uses is the gitignore function, which has caused multiple issues in the past (That's the reason the docs recommend trying to turn off gitignore detection for performance afaik?).

With this PR, instead this happens on a globby() call:

  1. The repo is walked using @nodelib/fs.walk, incrementally building the gitignore list. Files in gitignored directories are not walked (that's what fast-glob uses internally, but fast-glob doesn't support dynamic ignores (Feature request: dynamically ignore files or directories mrmlnc/fast-glob#265)).

  2. The gitignore list is cached per passed cwd so they are only ever computed and parsed once.

    Since I don't think there's a lib that actually supports collecting multiple gitignore I wrote a small gitignore->micromatch converter. This converter is probably not perfect, but the globby solution is also wrong (I can give details if needed).

  3. fast-glob is called with the patterns and ignore list from knip itself combined with the ignore list from the gitignore files. This causes fast-glob to also do early pruning which makes walking here much faster as well.

This should fix #425 and inverse the recommendation to set gitignore: false . Instead, knip should now be faster when ignoring more files which makes sense.

(Rough) Benchmark Before After
My Repo with a huge gitignored directory inside 30+minutes (didn't wait for it to finish). > 200 million syscalls 2:02min
My Repo with just build artifacts 3:20min 1:50min

The reason I'm submitting this here and not to globby itself is that (a) globby is written in JS not TS and (b) has multiple other functions that would probably be affected but knip doesn't care about.

Copy link

vercel bot commented Jan 3, 2024

Someone is attempting to deploy a commit to a Personal Account owned by @webpro on Vercel.

@webpro first needs to authorize it.

@webpro
Copy link
Collaborator

webpro commented Jan 4, 2024

Hi @phiresky, this is fantastic! 🔥 Thank you so much for doing this.

This is such a huge win, happy to replace globby and merge your PR. I don't have an issue with maintaining that here. Just a few points:

@phiresky
Copy link
Contributor Author

phiresky commented Jan 4, 2024

I've rebased this on master.

Not sure why npx knip behaves like it does on Windows

If that's only happening since this branch then it's probably related to some path joining doing something with \ and not /. Globby I think has some logic to convert \ to / so there's probably some inconsistency now. knip --debug should reveal it but it's annoying to test without Windows.

great if this PR fixes EACCES errors too.

It should fix access errors for files / directories that are in ignored directories yes. If there's an access error in a file that's part of the repo then no. (But that seems like it probably shouldn't be ignored anyways)

to remove that "performance boost" section

Note that this potentially can still decrease the performance against gitignore: false if the repo is fully clean anyways. Since (a) it still does an additional directory traversal to find gitignore files and (b) the number of patterns passed to fast-glob is larger especially if the gitignore files are large. The clean repo is something that rarely happens locally, but in a CI environment it's well possible.

I don't think the impact should be much though because it's a single scan that (on my larger repo) only takes <1s. fast-glob optimizes patterns I think by e.g. parsing them into path segments and matching those faster but not sure.

@webpro
Copy link
Collaborator

webpro commented Jan 4, 2024

I've rebased this on master.

Thanks!

Not sure why npx knip behaves like it does on Windows

If that's only happening since this branch then it's probably related to some path joining doing something with \ and not /. Globby I think has some logic to convert \ to / so there's probably some inconsistency now. knip --debug should reveal it but it's annoying to test without Windows.

I have a Windows laptop right here, I'll look into it.

great if this PR fixes EACCES errors too.

It should fix access errors for files / directories that are in ignored directories yes. If there's an access error in a file that's part of the repo then no. (But that seems like it probably shouldn't be ignored anyways)

Yeah that's what I meant, at least users can ignore problematic dirs/files.

to remove that "performance boost" section

Note that this potentially can still decrease the performance against gitignore: false if the repo is fully clean anyways. Since (a) it still does an additional directory traversal to find gitignore files and (b) the number of patterns passed to fast-glob is larger especially if the gitignore files are large. The clean repo is something that rarely happens locally, but in a CI environment it's well possible.

I don't think the impact should be much though because it's a single scan that (on my larger repo) only takes <1s. fast-glob optimizes patterns I think by e.g. parsing them into path segments and matching those faster but not sure.

Thanks, I'll benchmark a bit. But the overall win is a no-brainer 🙏

On a related note, have you seen https://github.com/fabiospampinato/fast-ignore? If not, might have some good ideas as well.

@webpro
Copy link
Collaborator

webpro commented Jan 4, 2024

Easy fix up for Windows, to be continued.. Just for the record:

diff --git a/packages/knip/src/util/globby.ts b/packages/knip/src/util/globby.ts
index 1fc3ef1e..2b9fb378 100644
--- a/packages/knip/src/util/globby.ts
+++ b/packages/knip/src/util/globby.ts
@@ -37,7 +37,7 @@ function convertGitignoreToMicromatch(pattern: string, base: string) {
 
 function parseGitignoreFile(filePath: string, cwd: string) {
   const file = fs.readFileSync(filePath, 'utf8');
-  const base = path.relative(cwd, path.dirname(filePath));
+  const base = path.relative(cwd, path.dirname(path.toPosix(filePath)));
 
   return file
     .split(/\r?\n/)

@webpro
Copy link
Collaborator

webpro commented Jan 4, 2024

When trying this on other repos, I've noticed some where indeed faster, but others were - surprisingly - slower than before.

A couple of things I tried and seem to be working well:

  • Using tiny-readdir-glob which is like specialized for this use case is faster than @nodelib/fs.walk.
  • The deepFilter works great, but ignores should be pre-populated with node_modules, otherwise the gains are basically lost. Fortunately tiny-readdir-glob offers the same option.

I've benchmarked by wrapping e.g. timerify(parseFindGitignores) and then knip --performance. The results are pretty significant (50-80% improvement for this particular function on various repos). I'm really curious if you see improvements on your workloads too.

WDYT?

diff --git a/packages/knip/package.json b/packages/knip/package.json
index f6fc237d..d9b5dc16 100644
--- a/packages/knip/package.json
+++ b/packages/knip/package.json
@@ -54,7 +54,6 @@
   ],
   "dependencies": {
     "@ericcornelissen/bash-parser": "0.5.2",
-    "@nodelib/fs.walk": "2.0.0",
     "@npmcli/map-workspaces": "3.0.4",
     "@npmcli/package-json": "5.0.0",
     "@pkgjs/parseargs": "0.11.0",
@@ -71,6 +70,7 @@
     "pretty-ms": "8.0.0",
     "strip-json-comments": "5.0.1",
     "summary": "2.1.0",
+    "tiny-readdir-glob": "1.2.1",
     "zod": "3.22.4",
     "zod-validation-error": "2.1.0"
   },
diff --git a/packages/knip/src/util/globby.ts b/packages/knip/src/util/globby.ts
index 2b9fb378..8300d42c 100644
--- a/packages/knip/src/util/globby.ts
+++ b/packages/knip/src/util/globby.ts
@@ -1,13 +1,12 @@
 import * as fs from 'fs';
-import { promisify } from 'node:util';
-import { walk as _walk } from '@nodelib/fs.walk';
 import { type Options as FastGlobOptions } from 'fast-glob';
 import fastGlob from 'fast-glob';
 import micromatch from 'micromatch';
+import readdir from 'tiny-readdir-glob';
+import { GLOBAL_IGNORE_PATTERNS } from '../constants.js';
 import { debugLogObject } from './debug.js';
 import * as path from './path.js';
 
-const walk = promisify(_walk);
 type Options = {
   /** Respect ignore patterns in `.gitignore` files that apply to the globbed files. */
   readonly gitignore?: boolean;
@@ -48,23 +47,19 @@ type Gitignores = { ignores: string[]; unignores: string[] };
 
 /** walks a directory, parsing gitignores and using them directly on the way (early pruning) */
 async function parseFindGitignores(options: Options): Promise<Gitignores> {
-  const ignores: string[] = [];
+  const ignores: string[] = ['.git', ...GLOBAL_IGNORE_PATTERNS];
   const unignores: string[] = [];
-  const consideredFiles: string[] = [];
-  await walk(options.cwd, {
-    entryFilter: entry => {
-      if (entry.dirent.isFile() && entry.name === '.gitignore') {
-        consideredFiles.push(entry.path);
-        for (const rule of parseGitignoreFile(entry.path, options.cwd))
-          if (rule.negated) unignores.push(...rule.patterns);
-          else ignores.push(...rule.patterns);
-        return true;
-      }
-      return false;
-    },
-    deepFilter: entry => !micromatch.any(path.relative(options.cwd, entry.path), ignores, { ignore: unignores }),
+  const result = await readdir('**/.gitignore', {
+    cwd: options.cwd,
+    ignore: (entry: string) => micromatch.any(path.relative(options.cwd, entry), ignores, { ignore: unignores }),
   });
-  debugLogObject(options.cwd, 'gitignore files', { consideredFiles, ignores, unignores });
+  for (const filePath of result.files) {
+    for (const rule of parseGitignoreFile(filePath, options.cwd)) {
+      if (rule.negated) unignores.push(...rule.patterns);
+      else ignores.push(...rule.patterns);
+    }
+  }
+  debugLogObject(options.cwd, 'gitignore files', { consideredFiles: result.files, ignores, unignores });
   return { ignores, unignores };
 }
 const cachedIgnores = new Map<string, Gitignores>();

@phiresky
Copy link
Contributor Author

phiresky commented Jan 5, 2024

I'm a bit surprised with your results . What are the absolute numbers you're looking at? Is it more total like 100ms or total 30s+? Could you also say how many directories total are in your test dir and how many are gitignored? That @nodelib/fs.walk library is what fast-glob also uses internally so I'm confused why it's not fast.

Slower than before is very surprising because as I said the previous method scans through the directory many times, once for each plugin matched times for each package (afaik). Are they very simple repos?

Maybe it's not the library but just the fact that your code isn't using the entryFilter callback I have. Can you try again with a mix of my and your code, pre-add the GLOBAL_IGNORE_PATTERNS, use @nodelib/fs.walk but remove the entryFilter and use the for (const filePath of result.files) { instead that you wrote?

I can benchmark it tomorrow but just looking at your patch I don't think it's going to work well on my repo - the issue is that you're not building the ignore list incrementally while walking and thus pruning the walk tree directly - that's probably why you also need to have "node_modules" in your pre-created ignore list. With my method it shouldn't matter because the root gitignore is going to be added to the ignore list before any deeper directories are traversed. I know that pre-adding node_modules to all kinds of ignore lists is a neat hack that tons of JS projects use but imo it's a bit of a crappy workaround that indicates the code is unnecessarily inefficient. (try renaming the node_modules directory, moving it to a subdir and gitignoring it there). For example if you have tons of build artifacts outside of node_modules.

There's a compromise where you parse the root .gitignore first and then walk the rest of the tree non-incrementally. Since most people seem to not even know about .gitignore files in non-root directories it's probably an ok heuristic but it's a bit ugly because then for some cases it's still going to be slow.

@phiresky
Copy link
Contributor Author

phiresky commented Jan 5, 2024

You can also try the opposite, add the pruning entryFilter to your tiny-readdir-glob implementation,something lke this:

ignore: (entry: string) => {
  if (micromatch.any(...)) return true;
  if (basename(entry)) === ".gitignore")
  // do the parseGitignoreFile steps from above to parse the gitignore and add it directly to ginores
  return false;
}

@webpro
Copy link
Collaborator

webpro commented Jan 5, 2024

Here are some numbers. "knip" means this repo at 92c1ac0035e3a63ab4776d224d3d31f96e8bcdc8, "kadena.js" (which the "boost doc" was inspired from) is https://github.com/kadena-community/kadena.js at 06315e20c0b750867965fbe00f0e9babce8bef1b. But that doesn't matter too much, what matters most is the number of invocations and the sums/total time.

I've "timerified" parseFindGitignores and ignore/deepFilter.

tiny-readdir-glob w/ pre-populated ignores

knip

Name                            size  min     max     median  sum
------------------------------  ----  ------  ------  ------  -------
findReferences                   324    0.70  808.35    3.61  2254.22
createProgram                      2  308.56  684.66  496.61   993.21
getTypeChecker                     2   67.09  197.88  132.49   264.97
parseFindGitignores                1   72.27   72.27   72.27    72.27
glob                              24    0.00   14.09    2.08    68.77
getImportsAndExports             384    0.01    3.91    0.06    50.14
ignore                          3587    0.01    2.52    0.01    46.78

Total running time: 3.8s (mem: 454.98MB)

kadena.js

Name                             size   min     max      median  sum
-------------------------------  -----  ------  -------  ------  -------
findReferences                     268    0.06  1804.54   10.30  9157.61
createProgram                        5  294.66  1721.97  528.57  3531.50
glob                               180    0.00   107.27    6.32  1686.14
getTypeChecker                       5   65.18   369.50  110.40   760.86
load                               205    0.00   108.03    0.00   415.60
getImportsAndExports              1412    0.01     8.76    0.08   239.39
parseFindGitignores                  1  190.48   190.48  190.48   190.48
findManifestDependencies            30    0.28    23.04    3.59   151.46
ignore                           13117    0.01     3.17    0.01   143.26

Total running time: 18.1s (mem: 1921.63MB)

tiny-readdir-glob w/o pre-populated ignores

knip

Name                            size   min     max     median  sum
------------------------------  -----  ------  ------  ------  -------
findReferences                    324    0.69  834.38    3.64  2298.06
createProgram                       2  308.51  689.71  499.11   998.21
getTypeChecker                      2   65.45  173.46  119.45   238.91
parseFindGitignores                 1  163.58  163.58  163.58   163.58
ignore                          39560    0.00    2.64    0.00    77.99

Total running time: 3.9s (mem: 461.89MB)

kadena.js

Name                             size    min      max      median   sum
-------------------------------  ------  -------  -------  -------  -------
findReferences                      268     0.06  1718.21    13.54  8820.22
createProgram                         5   374.71  1758.02   495.55  3607.26
glob                                180     0.00    86.44     6.77  1784.92
parseFindGitignores                   1  1037.75  1037.75  1037.75  1037.75
getTypeChecker                        5    66.74   357.90   119.04   770.38
ignore                           230094     0.00     3.38     0.00   518.90

Total running time: 18.8s (mem: 2128.25MB)

@nodelib/fs.walk w/o pre-populated deepFilter

knip

Name                            size  min     max     median  sum
------------------------------  ----  ------  ------  ------  -------
findReferences                   324    0.70  815.89    3.75  2367.43
createProgram                      2  316.89  689.58  503.24  1006.47
parseFindGitignores                1  320.36  320.36  320.36   320.36
deepFilter                       993    0.01    3.88    0.25   298.12
getTypeChecker                     2   67.95  177.24  122.59   245.19
glob                              24    0.00   14.61    1.72    58.66

Total running time: 4.1s (mem: 437.62MB)

kadena.js

Name                             size  min      max      median   sum    
-------------------------------  ----  -------  -------  -------  -------
findReferences                    268     0.06  2014.32    11.72  9068.85
createProgram                       5   318.58  1898.07   528.95  3784.06
parseFindGitignores                 1  2962.00  2962.00  2962.00  2962.00
deepFilter                       1350     0.01    14.65     1.73  2914.62
glob                              180     0.00    84.17     5.77  1525.00

Total running time: 20.8s (mem: 2010.77MB)

@phiresky
Copy link
Contributor Author

phiresky commented Jan 5, 2024

Quick question, why is getTypeChecker and load and getImportsAndExports not called for the last run (kadena.js with fs.walk)? Are the results the same?

@webpro
Copy link
Collaborator

webpro commented Jan 5, 2024

I've yanked bottom lines just to reduce noise

@webpro
Copy link
Collaborator

webpro commented Jan 11, 2024

@phiresky I'm very excited to get this merged and include it with the v4 release I'm preparing. Also, I'd really like to thank you for pushing this and attribute the whole pull request 100% to you (that's also why I've pasted the diffs here).

Any chance you could still give it a shot on your workload? Are you in favor of adding my changes to your PR?

@phiresky
Copy link
Contributor Author

phiresky commented Jan 11, 2024

Sorry, I got sidetracked by other stuff.

Here's a comparison benchmark with your change above:

running on my repo

my branch, unmodified except for timerify(parseFindGitignores):

Name                           size  min      max      median   sum
-----------------------------  ----  -------  -------  -------  --------
findReferences                 2600     1.22  5007.27    11.32  61145.23
createProgram                     3  1119.93  5364.40  2544.73   9029.06
_parseFindGitignores              1  5278.94  5278.94  5278.94   5278.94
getTypeChecker                    3   271.09  1304.57   290.96   1866.63
glob                            192     0.00   118.56     3.94   1202.12

Total running time: 2m 7.1s (mem: 3549.74MB)

your patch:

(i'm sorry, I waited for 15 minutes and it's still in "Connecting the dots..." so i killed it. no results)

Only after moving out a huge gitignored backup directory from the repo, these are the results:

Name                           size   min     max      median   sum
-----------------------------  -----  ------  -------  -------  --------
findReferences                  2600    1.07  4883.00    10.53  56777.95
createProgram                      3  873.41  4522.90  2457.53   7853.84
glob                             192    0.00   290.43     9.65   3052.22
getTypeChecker                     3  280.92  1305.90   297.33   1884.14
_parseFindGitignores               1  912.77   912.77   912.77    912.77
getImportsAndExports            1678    0.00    16.86     0.26    814.45
ignore                         66244    0.01     8.69     0.01    674.02
load                             236    0.00   267.22     0.00    361.30

Total running time: 5m 28.8s

(I guess most of the time is spent in untracked functions)

running on kadena.js

my branch:

Name                             size  min       max       median    sum
-------------------------------  ----  --------  --------  --------  --------
_parseFindGitignores                1  12242.63  12242.63  12242.63  12242.63
deepFilter                        899      0.01     60.75      6.05  12122.08
findReferences                    285      0.08   1593.18     17.87  12111.61

your patch:

Name                             size  min     max      median  sum
-------------------------------  ----  ------  -------  ------  --------
findReferences                    285    0.07  1650.59   15.96  12145.01
_parseFindGitignores                1  107.67   107.67  107.67    107.67

So yeah, I can reproduce your patch being faster on kadena.js. But it's much slower on my repo with large amounts of gitignored directories.

new version

It's pretty easy to see that parseFindGitignores spends almost all of it's time in deepFilter. The only thing that does is call micromatch.any. Looking at that function in micromatch, it's pretty dumb. It constructs a picomatch matcher and then immediately calls it. If we instead construct that picomatch matcher ourself once per gitignore and use the same instance for every directory, these are the results on kadena.js:

Name                             size  min     max      median  sum
-------------------------------  ----  ------  -------  ------  --------
findReferences                    285    0.06  1442.85   14.32  10065.17
_parseFindGitignores                1  213.47   213.47  213.47    213.47

That's still slightly slower but much less significant in the grand scheme of things. I guess micromatch is just a bad wrapper around picomatch.

Now again here's an overview table of my results

Repo pruning fs.walk with micromatch non-pruning tiny-glob pruning fs.walk with picomatch (this branch now)
my private one (with large backup directory inside) 2m5s (5.5s in parseFindGitignores) [>15min, doesn't finish] 1m10s (178ms in parseFindGitignores)
my private one (only build artifacts) 2m7s (5.4s in parseFindGitignores) 5m28s (912ms in parseFindGitignores) 1m16s (188ms in parseFindGitignores)
kadena.js 31.6 (10s pfgitignore) 21.3s (110ms pfg) 18s (205ms pfg)
knip repo 5.4s (400ms pfg) 4.6s (85ms pfg) 4.5s (42ms pfg)

(An additional improvement looks like it comes from updating the isGitIgnoredFn() function to use picomatch as well)

Please try the updated branch and let me know your results. aab7788

@webpro
Copy link
Collaborator

webpro commented Jan 11, 2024

Awesome, the latest version is faster in every instance, compared to both the main branch and my patch in this PR, in any repo I try it on 🚀 Best of all worlds :)

Very glad we made it through, thanks again! I'm going to merge this and fix any remaining issues for Windows.

@webpro webpro merged commit 288cf45 into webpro-nl:main Jan 11, 2024
0 of 2 checks passed
@webpro
Copy link
Collaborator

webpro commented Jan 11, 2024

Had to fix a few things for Windows and then some, but this is published in v4.0.0-canary.10

@webpro
Copy link
Collaborator

webpro commented Jan 11, 2024

Bummer, I knew I forgot something. Linting the https://github.com/TanStack/query repository (already using Knip, with 73 workspaces) went up from 4.7 to 16.1 seconds. Going to dive into that further tomorrow.

@drodil
Copy link
Contributor

drodil commented Jan 11, 2024

You could also give backstage repo a try from the root 👌

@webpro
Copy link
Collaborator

webpro commented Jan 12, 2024

The regression I've mentioned in my last comment has been fixed (and actually improved by a good margin) in 4.0.0-canary.11.

You could also give backstage repo a try from the root 👌

Latest master of this repo (with the Knip config I provided in that PR, excluding class members or it explodes):

$ knip --exclude classMembers --performance
...
Total running time: 44.9s (mem: 2854.41MB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Knip walks ignored directories hundreds of times
3 participants